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Nonlinear Transformation 


Roadmap 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
Ө How Can Machines Learn? 


Lecture 11: Linear Models for Classification 
binary classification via (logistic) regression; 
multiclass via OVA/OVO decomposition 













Lecture 12: Nonlinear Transformation 
e Quadratic Hypotheses 

e Nonlinear Transform 

e Price of Nonlinear Transform 

e Structured Hypothesis Sets 





@ How Can Machines Learn Better? 
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Nonlinear Transformation Quadratic Hypotheses 


Linear Hypotheses 


up to now: linear hypotheses $ but limited... 
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e visually: ‘line’-like e theoretically: oy. under 
boundary control :-) 
e mathematically: linear • practically: on some 2, 
scores s — w/x | large Ein for every line :-( 








how to break the limit of linear hypotheses 
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Nonlinear Transformation Quadratic Hypotheses 


Circular Separable 


























e D not linear separable 


e but circular separable by a circle of 
radius v0.6 centered at origin: 








hsep(X) = sign (X = 0.6) 





re-derive Circular-PLA, Circular-Regression, 
blahblah ... all over again? :-) 
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Nonlinear Transformation Quadratic Hypotheses 


Circular Separable and Linear Separable 






M CES т; 


h(x) — sign 0.6 
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Wo 


sign (% D 











e {(Xn, yn)) circular separable "m x 
=> {(Zn, yn)! linear separable ae 
x 72 
«х= 7-2 Í aes 
(nonlinear) feature 5 * 
transform ® | ° 2: 


























0 0.5 1 


circular separable in 2 — linear separable in Z 
vice versa? 





Hsuan-Tien Lin (NTU CSIE) Machine Learning Foundations 4/22 


Nonlinear Transformation Quadratic Hypotheses 
Linear Hypotheses in Z-Space 
(25, 21,22) z = Ф(х) = (1, x7, X5) 


h(x) = A(z) = sign (77900) = sign (iio + yx? + йж) 








e (0.6, —1, — 1): circle (o inside) 
e (—0.6, +1, -1): circle (o outside) 
e (0.6, —1, —2): ellipse 

e (0.6, —1, +2): hyperbola 

e (0.6, 4-1, 4-2): constant o :-) 


lines in Z-space 
<> special quadratic curves in 2-ѕрасе 
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Nonlinear Transformation Quadratic Hypotheses 


General Quadratic Hypothesis Set 


a ‘bigger’ Z-space with b;(x) = (1, x1, х, х2, Ху Xe, х2) 





perceptrons in Z-space <> quadratic hypotheses in Х-врасе 


Ho, = { h(x): h(x) = h(®2(x)) for some linear ^ on ғ} 
e can implement all possible quadratic curve boundaries: 
circle, ellipse, rotated ellipse, hyperbola, parabola, ... 
ellipse 2(x1 + Х = 3)? + (х; Хо — 4)° ==! 
< w! = [33, —20, —4,3, 2,3] 


e include lines and constants as degenerate cases 








next: learn a good quadratic hypothesis g 
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Nonlinear Transformation Quadratic Hypotheses 


Fun Time 


Using the transform ®2(x) = (1, ху, хо, X2, ХүХо, X$), which of the 
following weights W in the Z-space implements the parabola 
2x? + xo = 1? 

© [-1.2.1,0,0,0] 
Ө [0,2,1,0,—1,0] 
Ө [-1,0,1,2,0,0] 
Ө [-1,2,0,0,0, 1] 





Nonlinear Transformation Quadratic Hypotheses 


Fun Time 


Using the transform ®5(x) = (Gs АА S which of the 
following weights W in the Z-space implements the parabola 
2x? + xo = 1? 

© [-1,2,1,0,0, 0] 

Ө [0,2,1,0,—1,0] 

Ө [—1,0,1,2,0,0] 

Ө [-1,2,0,0,0, 1] 





Reference Answer: 9 


Too simple, uh? :-) Flexibility to implement 
arbitrary quadratic curves opens new 
possibilities for minimizing Ein! 
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Nonlinear Transformation Nonlinear Transform 


Good Quadratic Hypothesis 


Z-space X-space 
perceptrons — quadratic hypotheses 
good perceptron — good quadratic hypothesis 


separating perceptron <= separating quadratic hypothesis 





e want: get good perceptron in Z-space 

e known: get good perceptron in 2-ѕрасе with data {(Xn, yn)! | 
todo: get good perceptron in Z-space with data ((z; = ®2(Xn), Yn) } | 
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Nonlinear Transform 


Nonlinear Transformation 
The Nonlinear Transform Steps 
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Ө transform original data { (х, уһ) } to { (2, = Ф(Хл),Ул)) by Ф 


Ө get a good perceptron W using { (2, Ул) 
and your favorite linear classification algorithm A 
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Ө return g(x) = sign (W e(x)) 





Machine Learning Foundations 


Hsuan-Tien Lin (NTU CSIE) 


Nonlinear Transformation Nonlinear Transform 


Nonlinear Model via Nonlinear Ф + Linear Models 
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Pandora's box :-): 
can now freely do quadratic PLA, quadratic regression, 


cubic regression, ..., polynomial regression 





Nonlinear Transformation Nonlinear Transform 


Feature Transform ® 
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Average Intensity 


not new, not just polynomial: 
Š domain knowledge А 5 
raw (pixels) — concrete (intensity, symmetry) | 


the force, too good to be true? :-) 
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Nonlinear Transformation Nonlinear Transform 


Fun Time 


Consider the quadratic transform ®2(x) for x є R? instead of in IR?. 
The transform should include all different quadratic, linear, and 
constant terms formed by (xi, X2,...,Xq). What is the number of 
dimensions of z = ®2(x)? 

Od 

Ө с 

Ө d? +а+1 
Ө 2° 
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Nonlinear Transformation Nonlinear Transform 


Fun Time 


Consider the quadratic transform Ф(х) for x є R? instead of in IR?. 
The transform should include all different quadratic, linear, and 
constant terms formed by (xi, X2,...,Xq). What is the number of 
dimensions of z = ®2(x)? 

Od 

Ө ei 

Ө d? +d+1 

Ө ?? 








Reference Answer: Ө 


Number of different quadratic terms is (2) + d; 
number of different linear terms is d; 
number of different constant term is 1. 
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Nonlinear Transformation Price of Nonlinear Transform 
Computation/Storage Price 


Q-th order polynomial transform: ® Q(x) = ( її 


Xis Хол ее gy 
Oe, oe 

л 
ХЭЭ xg) 





1 d dimensions 
2. 
Wo others 


= # ways of < Q-combination from d kinds with repetitions 


JJ LI 
= efforts needed for computing/storing z = Фо(х) and W 


Q large — difficult to compute/store | 
Heuan-Tien Lin (NTUCSE) ЕЕЕ 122 





Nonlinear Transformation Price of Nonlinear Transform 
Model Complexity Price 


Q-th order polynomial transform: ® Q(x) = ( 1, 


X45 X9, ss Жа, 
E xe x. 
en 
peu o NR pe) 





+ _d_ dimensions = O(Q") 


Wo others 


р 

~~ 
• number of free parameters W; = d + 1~ dyc(Ho,) 
© dyc(Hoo) < d + 1, why? 


any d+ 2 inputs not shattered in Z 
==> any d + 2 inputs not shattered in X 


Q large — large dvc | 
Heun Tienin (NTUCSI ияя Є җ%җ‘җ 





Nonlinear Transformation Price of Nonlinear Transform 


Generalization Issue 





which one do you prefer? :-) 
° Ф; ‘visually’ preferred * 
e ®4: Ei (g) = 0 but overkill 


























Ф; (original х) o, 


@ can we make sure that Ecut(g) is close enough to Ej, (g)? 
Ө can we make Е.(9) small enough? 
d (Q) | C) | (2) 
trade-off: higher | :-( | :-р 
lower | >D | :-( 

















how to pick Q? visually, maybe? | 
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Nonlinear Transformation Price of Nonlinear Transform 


Danger of Visual Choices 


first of all, can you really ‘visualize’ when X = R9? (well, | can't :-)) 








Visualize ¥ = R? 
° full 65: z = (1, х, хо, X2, Ху, X2), бус = 6 
e or z = (1, x2, х2), аус = 3, after visualizing? 
e or better z = (1, x? + x2) , ас = 2? 
e or even better z = (sign(0.6 — x? — x2))? 
—careful about your brain’s ‘model complexity’ 


for VC-safety, ® shall be 
decided without ‘peeking’ data | 
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Nonlinear Transformation Price of Nonlinear Transform 


Fun Time 


Consider the Q-th order polynomial transform Ф Q(x) for x € Rĉ. Recall 
that d — (97?) — 1. When Q — 50, what is the value of d? 


Ө 1126 
Ө 1325 


Ө 2651 
@ 6211 





Price of Nonlinear Transform 


Fun Time 


Nonlinear Transformation 


Consider the Q-th order polynomial transform ® Q(x) for x є R?. Recall 
that d — (97?) — 1. When Q — 50, what is the value of d? 

@ 1126 

@ 1325 

Ө 2651 

@ 6211 














Reference Answer: (2) 

It's just a simple calculation, but shows you 
how d becomes hundreds of times of d = 2 
after the transform. аг 
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Nonlinear Transformation Structured Hypothesis Sets 


Polynomial Transform Revisited 
Фо(х) — (1),Ф‹(х) = (0(x), ЭНН) 
Ф:(х) = (o; (X), XXe... Xd) 


Фз(х) = (2(x), хохо x 


a(x) = (Фо (х), xP 








structure: nested 71, | 
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Nonlinear Transformation Structured Hypothesis Sets 


Structured Hypothesis Sets 
Со D Чэ H3 


Let gi = argmin,-z,, Ein(h): 










Ho (е Hy С Ho = H3 (28 
< Ac(Hi) < а(н) < Ac(H3) < 
2 2 Е.(92) 2 2 





out-of-sample error 


model complexity 





use H1126 won't be good! :-( J 


Error 





in-sample error 








d. 


2 š : 
im VC dimension, dvc 


Nonlinear Transformation Structured Hypothesis Sets 
Linear Model First 


out-of-sample error 







model complexity 


Error 


in-sample error 








a --- > -–- нэ. 


" : : 
УС VC dimension, dve 


e tempting sin: use H1126, low Ein(91126) to fool your boss 
—Treally? :-( a dangerous path of no return 
e safe route: 71, first 
e if Ej(g1) good enough, live happily thereafter :-) 
e otherwise, move right of the curve 
with nothing lost except ‘wasted’ computation 





linear model first: 
simple, efficient, safe, and workable! | 


Nonlinear Transformation Structured Hypothesis Sets 


Fun Time 


Consider two hypothesis sets, Hı and H1126, where Hy C H1126- 
Which of the following relationship between d\yc(H1) and dyc(?14126) is 
not possible? 


Ө We(H1) = dyc(711126 
Ө avc(?t1) Z дус( 

Ө AWe(H1) < dyc(711126 
Ө Ac(H1) > dvc(?04126 






H1126 


) 
) 
) 
) 
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Nonlinear Transformation Structured Hypothesis Sets 


Fun Time 


Consider two hypothesis sets, Hı and H1126, where Hy C H1126- 
Which of the following relationship between d\yc(H1) and dyc(?14126) is 


not possible? 
© dyc(?4) = avc(711126) 
Ө avc(?t1) Z AWc(H1126) 
Ө Ac(H1) < Ac(H1126) 
Ө avc(?t1) > Ave(H1126) 


Reference Answer: (4) 


Every input combination that H4 shatters can 
be shattered by H1126, so dyc cannot 
decrease. 
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Nonlinear Transformation Structured Hypothesis Sets 


Summary 
@ When Can Machines Learn? 
@ Why Can Machines Learn? 
Ө How Can Machines Learn? 


Lecture 11: Linear Models for Classification 
Lecture 12: Nonlinear Transformation 


ə Quadratic Hypotheses 
linear hypotheses on quadratic-transformed data 
e Nonlinear Transform 
happy linear modeling after Z = ^(X) 
ə Price of Nonlinear Transform 
computation/storage/[model complexity] 
ə Structured Hypothesis Sets 
linear/simpler model first 





e next: dark side of the force :-) 
@ How Can Machines Learn Better? 
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